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ABSTBACT 

Twelv.e tests of raading and math at the alemantary 
level are classified according to a modal which aakas a distinctipn 
between critarion and domain teats* Score raporting and item analysis 
techniguaa are discusaed. It ia argued that moat 

objectives-refaranced tests do not specify their domains sufficiantly 
to make intarpratationa more ganaral than tha test items themaal¥fs. 
(Author) 
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Twelve taats of ramding and mat^ at tiia elementary level are 
classified aeee^rding to a model iihlch makes a dlstlnatlon be* 
twoen criterion and domain t^sta. Score reporting and Item 
analysis techniques ax^e dlsaussed. It is argued that most 
Ob jeGtives--rafereneed tests do not speolfy ^elr domains 
sufflelently to m^e Interpretations more general than the 
test items themselves. 
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Score Rapofting and Item Seleotlen in Saleatedl 
Critarioa Refaranaed aad Dosain Referaneefi Teats 

Carolyn H . Beriham 

- California State Unlvarsity, Long Baaeh 

when %fe ^ar^'busy creating the distinction betwaen norm referencad 

and eritsrioa rafaranoad tasts, %ie oould overlook the dif fiaultieR^^ith , 

otiir definitions of eriterion and domain rafarancad taets. Now it is 

time tio make a distinetion batwaan eritarion rafaranead t^sts and domain 

rafaranoed taste # ThB present situation is too confusing « The follow* 

ing dafinition is an example CQlaser^ 1971 ^p* 41) i 

A eritarion»refarancad test is one that is deliberately 
conatrueted to yield maasuraments that ara diraatlyint 
able in terms of spaaif iad puforaance standards* Parfosiaanca 
standards are ganar ally spaoif led by defining a olass or domain 
o£ tasks that should be performed by tiie individual* Meaaiire- 
mants are rafaranead diraotly to i^is domain for each individual. 
Meaauraments are taken on representative samples of tasks c^awn 
from this domain and such maasurements are rafaran^ad directly 
to this domain for each individual measured. 

The definition mixes two types of test intarpratation* The first is an 
evaluatlva intarpratation in vhich a score is evaluated in terms of 
parformanca standards or criteria* Tn%m second is a dascriptiva inter- 
pretation in which a score is evaluated in tarms of ^e domain of tasks 
represented by the items on tiia test* 

In this paper only those tests which compare the raw scores to 
performance standards will be aallad eritarion ref arancad tests (or, 
more simply^ criterion tests)* Those tests in which scores on a 
representative sample of a clearly dafihed domain of items are u.^^d 
to estimate scores on the entire domain will be called domain "^afaranced 
tests Cor domain tests)* 

Of course^ there may be tests which combine aspects of domain 
and criterion referencing* In an earlier paper Danham (1975) 
developed a model for test classification* The model provides for 



swen catagorlea of tea tsi criterion^ domain/ norm, ^ and tiia four 
possible combinations of ttie three primary categories* iBmm 
Figure I*) Examplaa of test score Interpretations for eaeh of the 
seven teat eatagories are given in Tmble 1. 

The preseiit paper classifies selectad criterion and domain tests 
according to ttie seven categoriea of tiia model for the dual purpoaa 
of testing the ade^acy of the model and describing ^a current state 
of test devalo^entp If the tests fail-^to ^fit .into the categoriaa of ^ 
the nodal, an attempt will be made to datarmin© whether the modal or 
l^is teats raira in need of improvement* 

Selectad for review ara teats or testing ays terns in math and 
reading for grades which are labeled criterion referenced, domain 
re£erGnced> instruction refaranced^ or objectives refer enc ad* aome 
ara aimply testis* otiiera are taating syatems from which the particular 
testa or itaras are selected for a given adminiatratlonj for convenienci 
the ^©rd teat will include both testa and tasting ayatems* Those teati 
which report group scores E'ather than individual scorea are omitted 
from the discuaaion.. Thus the examplary domain testing in tiie MINNB-* 
MAST project (Hivaly et al^/ 1973) and otiier testa employing matrix 
aampling procadures are not dl a cussed heffe- f. 

A liat by Kosecof f and Fink C1976*) tos the source of many of the 
test titlea* others ware located through minor detective work • The 
list of teats ia not intended to be exhaustive although the author 
has attempted to review most of the widely avai-lable testa Oi tasting 
systems* Comments on the teats are not Intanded to serve as critiques 
of tite individual testa slnee only certain aspects of ttte testa are 
disaussed. Readers should look elsewhere for comprehensive critiques 
of each of the tests* A . - 



Tmmt Clasalf icatiQn and Score Reporting 

In Tables 2, 3, and 4 are jUte alasslfieatlon and score reporting 
sya tarns for ttioae tests whioh fit int6""-^e seven categories of the 
model* Figure 2 illustrates the fact tiiat sIm tests~^ere classified 
as criterion^ three as domain, two as noro + criterion, and one as 
domain + criterion, A test was classified criterion referenced if — 
a criterion level were set and scores ware raported as above or below 
a Qriterion* A test was considered .^omain re^^ domain 
specifications were reasonably precise and/or its items were con- 
sidered samples ol a domain* A test was classified norm referancad 
if the test provided trWB formed scores such as percentiles or a tan** 
dard scores* Simply providing data on the parformance of groups (such 
as that provided by ttie lOX Ob jectiva*Basad Tests and by the EDITS 
Tests of Achievement on Basic Skills) was not considered sufficient 
to label, a test norm referenced. Xndaed, most criterion tests pro» 
vide some kind of group data in the form of classroom, school, or 
district performance* ; 

The items in a domain test may be written from item forms, 
definad by Hively, Patterson, and Page (1968) as rules for generating 
sets of test items* Hsu (1972) reports the use of item forms for 
some of the math tests in Individually Prescribed Instruction* 
Fopham (1975) reports a simplar method of defining domains using 
amplified objectives as they are used in the Objectives*Base4 Tests 
of the Instructional Objectives Exchange (lOX) . Evan lesa s true turad. 
is tlie system used by^^ which was classified as domain primarily / 

because longitudinal data are obtained tiirough repeated samples of 
the items wiUiin each objective* 



Hot all of tiie tests examined fit into onm of the saven categories 

of ths modal. The DiagnostlG Math Inventory {CTB/McGraw-Hill, 1975) 

has only one item per objeotive. Hively (1974, p. 140) discusses 

the situation of a test witti only one Item per objectives 

•••the inference from the item score to ttie domain score is 
primitlvei it only tells you about the probability thmt the 
students will respond correctly to tiiia same item if you pre- 
sent it again. 

If you want stronger inference^ you can construct more 
items for each objective p and then you can sample some of 
them and estimate ttie probability^^ to or 
group will respond correctly to toe otoers. That is 

the only difference between a domain-referenced test and 
an objeetive-referenced test* The strengto of toe inference 
depends on toe representativeness of the set of itema assoc- 
iated wi to each objective* 

Thus toe Dia^ostic Kato Inventory does not fit toe present definition 
of domain testing* 

Nor was the Diagnostic Math Inventory classified as a criterion 
test. One could argue that there im an implied criterion of a correct 
response to toe single item representing each objective, but such a 
criterion adds little to toe test interpretation toat could be 
achieved by simply examining the test itself i Indeedy toe most basic 
interpretation of a score is simply to examine the test items. All 
of toe categories in toe models however, are Intended to refer to 
ways in which a raw score can be gi van meaning by referencing it to 
soraetoing outside the tests a norm group, a criterion IWeT, or a 
domain description. . ^ 



The Key Mato Diagnostic >^rithmetic Test C American Guidance 



Service, 1976) also depends on scores on single items for its 
"criterion-referenced" interpretations. Thus, like toe Diagnostic 
Math Inventory, it was classified as neltoer criterion nor domain. - 



If norm ref erancad testa had been reviewed In this papar, Uis Key 
Math could have bean classified as a norm teat* Its aathod for 
producing tha norm rafarencad scores, using Rasch*Wright procaduras^ 
is most sophisticated* 

Another type of test which does not fit th© model is the ob- 
jectives referanced test in which items are keyed to objectives but 
the objectives are not adequately precise to serve as domain defini- 
tions. In most of toesa tests/ even if they have more than one item 
per objective, the best way to interpret tha scores is to ^examine 
the test items* examining tte^objectiv^^ may be misleading because 
tiieir lack of specificity makes it appear that the test ia more 
comprahensivetiian examination ofvthe items reveals. Since the 
objectives in the Individual Pupil Monitoring aystem > for Raading and 
Math (Houghton-Mifflin, 1973) ware avidently not intended to serva 
as domain statiim^ts and no critarion level was set| ttie tast was 
not placed into any of tiie categories. 

Bacoming popular are tailor--made tests in which the user salects 
objectives from a list and test items are compiled to meet tiie user 
specifications* Examplas are ttKe ORBIT system by CTB/McGraw Hill 
and many of tiie computer test banks discussed by Lippey Cn.d. ) . 
Since ttiesa typically have neither domain statements nor criterion 
levels^ they will generally not fit into Uie categories of tiie model* 
Also the readar should be aware that th^^ items in such banks, par- 
ticularly Uiose not produced by teat publishers, rarely undergo the 
same scrutiny as the unitary testa produced by teat publiahara.. 
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Evaluation Of tiie Medel 

It is avidsnt toat many taata do not fit the author's definitions 
of domain and criterion testing* Does tiiis mean the model is inada* 
quate? No, in the auttor's ©pinion^ it reflects the state of tiia 
art in test davalopment. Tba_model lists ways of interpreting a 
score by referencing &e score to some-Uilng outside ^e testi a norm 
group, a criterion level, or a domain description. Many current tests 
are most appropriately interpreted by simple examination of ttieir 
items; even if there is a list of obj actives, interpreting the score 
in terms of the objectives may ba making unwarranted generalisations 
since many objectives are not specific enough tc describe ada^ately 
tiie items* 

Interpretation of^a test score by examining the items is a very 
useful procedure* However, there are two difficulties. The first 
problem, often eKaggarated, is the naad to keep the items secret* 
Portunately ttere are many instancea in which tee students, teachers, 
administrators, or parents may examine i^e items after a test admini- 
stration* In other instaMas, such as witii a large bank of items, the 
items may be examined before the test administration. 

The second problem is mori serious • It is the fact that test 
develope^rs and users want to make statements at a higher level of 

generality thw the teat itself * This is what makes the theory of 

- _ - - , ' , 

educational and psychological measurement more complex than that of 
physical measurement* It is for the task of making generalizations 
from specific items toat the procedure of domain testing is most 
promising. 
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Item Analyais 

Thm model suggasts a naed for different item analysis proaeduraa 
for each of tiie test categories. The first steps In an Item analysis 
procedure can be similar for all t^es of tests% f^ether a teat is norn 
criterion, or domain, ttie items must be free of faults such as those 
listed in tests and measurements books. Computing difficulty wd 
discrimination Indexes and discussing tiie items vltii studants are 
among the methods vhich can detect faulty items. A second' measure 
of an item is its content validity* A test developer may consult 
experts for judgments about ^e approprlatness of each item, or the 
developer may use empirical technl^es such as eKamlning Intercorre- 
la tions among items measuring ^e same objective. 

Pinallyi tiie developer must select among those items which are 
well -written and appropriate in their content* such selection is 
usually necessary since tiiere are practical limitations on the langtii 
of the test* TOie most efficient way of testing is to select items 
which contribu€e^%ie most to tixm t^e of score to be re|K5rted. it 
is for efficiency that medium difficulty items are selected for norm 
tests* Survey research techniques may improve efficiency when samp- 
ling from a domain* for example, content areas in which the measure-^ 
ments may be less reliable can be oversampled and those areas In 
which correlations among items are higher can be undarsampled. If- 
flclency of criterion tests might be improved by concentrating on 
Items near ttie difficulty level of tiie criterion, particularly in . 
ttiose tests which can be scaled according to difficulty level, 
ifflciency on any test could be increased if those items which are 



most aost-effaativa in terms of time are aelaeted; ^Is would take 
into eonsideration the fact ^at soma t^mm of test items take more 
of the subject^ s time do short Items suoh aa true/false items* 

Other suggestiona for item analysis for eriterion and domain tests 



ean be found in Sexdtam (1975) # 

Item analysis proeedures for each of the twelve tests were 



obtained trough study of published manuals and trough correspon* 
dense and conversations with tiie test publishers or developers«. Item 
analysis information was available for mil of -Uie tests except one* 
The following are the findings I 

1) Tests falling into different oatagorles of the model 
did not have dlstin^ishabla patterns of item analysis 
procedures* 

2) Most of tile tests developers arranged for the items to 
be reviewed by experts in addition to empirical procedures^ 
if any. 

3) Item analysis techniques which might reveal faults in 
item-writing were rarely used* ^is was be expected 
since many of tiie writings on crlterioh and domain 
disparage item diserimination and difficulty Indexes al- 
though they can be ^Ite useful in detecting item faults. 
In the mwiuals of ta^o of Uie tests, however, methods for 
detecting poorly written items were discussed separately 
from other Item analysis steps* 

4) Some tests experlmen ted with item^^aX^sls technlc^es not 
usually employed wi^ norm tests* Among these techniques 
were sensitivity to instruction, discrimination among 
mastery and nonmastery groups, and a variety of procedures 
for evaluating toe difficulty lei^l of ttie l terns* In ohfL^_„ 
test f orlgrades 4-6 only toose Itfems which were easy for 
the slKth graders were chosen* ^In toree tests, items with 
similar levels of difficulty were chosen to represent an 
objective. In two tests, items witii varying levels of 
difficulty were chosen to represe objective. In 
another test, items ^;Were chosen such tiiat they were neither 
••too hard nor too eisy** for ttie tryout group. ^ 

5) The Item analysis procedure for one test consisted of adminis 
terlhg the test to a few students and discussing the Items 
wl^ them* 



6) nie item anmlysis proesdura of uioth^r test adnalated of 
adtaitiistsiring thti t@at In one elassroom to detarmlna If 
all items memaurlng ^e same dbjectiye produead similar 
results..' 

7) In one test^ items whieh fit the Raseh-lfright model were 
seleeted* 

8) In one test, item forms, rather thw tiie items, were 
subjeeted to analysis o 

In summary, an impressive variety of teehni^es was employed* 
However, tiiare was seant attrition paid in most of the reports of 
item analyses to the detection of possible item faults. Additionally, 
tiio metliods of item seleetion used by some of ttie tests ware almost 
Uie opposite of tiiose used by otiier tests in the same category. For 
example, some tests sought uniform difficulty levels; pth^ 
variety in diffieulty levels. It is almost im^ssible to evaluate 
these selection metiiods without a systemma tic method of determining 
the purposes for which the item analysis was used. 

To help clear up tiie confusion, it is proposed that all criterion 
and domain tests perform each of tiiree kinds of item analysis pro- 
cedures! 



1) An examination of tiie accuracy of tte items^ the extei?^ 
to which the items are free of items writing faults au i 
as those listed in tests and measurements textboolc^ 

2) An examination of the content of tiie items tihe extent to 
which tiie items are representative of Uie objectives or 
the domain* V - ■ / . 

3) An examination of the efficiency of ti^m items, the extent 
to which the items contribute Information to ttie criterion 
or domain decision. 

Attention to each of these three types of item analysis data would 
mean that criterion and domain test developers would no longer 
neglect examination Of item accuracy. It should also help test develop 
ers think more clearly about the purposes of their item analysis pro- 
cedures and to invent new methods of item analysis, whetoer empirical 
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Other Teehnleal CQnaidarations 

f erfsetlng item analysis tachnlq^as for domain and critarion 
tests is only one of the many tasks remaining for researchera. The 
issues of reliability^ validity, estimation of domain soores, ;and 
estimation of master states are among the current problems. 

Livingston (1972), Huynh (1976), and Swaminathan (1974) report 
methods of computing reliability appli cable to eriterion testing/ 
We are in need of resear^ on metiiods of computing reliability of 
domain estimates* 

Maskauslcas (1976) discussed at le^ the problem of cut-off 
scores for criterion tests. We are in need of research on the 
estimation of domain scores from test items. Although toe problem 
may seem a simple one, it is actually quite complex. Two of the 
models which have been proposed for estimation of domain scores are 
ttie binomial model (Millman, 1974), in which the percentage of items 
answered correctly on toe test is talcen as a point estimate of toe 
domain score and group data is not considered, and the classical 
testing model, in which toe estimated domain score Is a regressed 
score utilising data on group performance. These two models were 
criticised by Haladyna (1975) . Anotoer procedure for estimating 
domain scores is toe Bayesian approach in which group data or otoer 
data may be used as prior information (Lewis et al. , 1973 & Novick 

al^, 1973). .The Rasch-Wrlght model and Cronbach's toeory of 
generalizability are two otoer models which could provide domain 
score estimates* ^ 



Summary 

The paper daserlbas twelya eritarion and domain toists in terms 
ofia model which makes diatinetions between criterion and domain 
tests. Score reporting and item analysis procaduras are discussed* 
The vauttior foiind few^tests which could legitimately ba ealled domain 
jand many objectives referenced tests which di^ not fit toe model 
This reveals a problem in current teat development^ although many 
tests are interpreted in terms of performance of objectives , these 
objectives are of ten too loosely written to serve as descriptions 
of the actual items or, on tiie oaier hand/ toe i tems are too few and 
too homogeneous to represent toe more broadly stated objectives In 
place of objectives referenced tests toe autoor a^ocatea domain 
referenced teats in which toe domains are clearly specified and an 
attempt is made to choose items whi^ are representative of toe 
domains. . ' 

Of course, if item forms as complex as toose written by Hi vely 
et al. (1973) must be used, many test developera might simply refuse 
to try. Specification of and sampling from domains is a matter of 
d^ree* The present author recommends more careful attention to 
specif ication and sampling so toat one may interpret toe test at a 
higher level of generality toan the teat itself* However, it is 
hoped toat teat developera do not make toe task ao complex toat toey 
lose ^ themaelvea in their domains • / Kie English essayist Char les Xiamb^^ 
(1823 ) must have been referring to auch a situation in his account" of 
©he man: "He was lord of his library, and seldom cared for looking 
out beyond his domaina.^ * _ 
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* The plaeement of ^ -twelve reading and math -tests wittiin the : model * 




Tabl© 1 

fi sscrlptian of Score Reporting for Each of the Sevan Catagorias of Tests 

1) HRs The student scores better than 85% of the norm group. 

2) DRt Thet student correctly spelled 9 out of 10 ^ords randomly 
ehDsen froa the list of sixth grade spelling word#r It is ■ 

^ ^iSitlmated tiiat h# can spell 90% of the words on -Une list* 

3) CR: The student met the criterion of 80% of tiie words .spelled 
correctly « 

4) HR + CR: The student scored better tiian 85% e£ the nor® group 
and met til e criterion of 80% of ttie words spelledi correctly, 

5) SR + DR: It is estimated that the student would score better 
than 85% of tile nora group on the entire list of siactii grade ; 
spelling words. It is estimated that the student can spell 

90% of the words on the list. 
,6) CR -I- DR: It is estimated tixat the studexit- w£^uld meet the 
criterion of 80% correct on the entire list of slx^ grade 
spelling words* It is estimated that tiie vstudent can^spell - 
90% of the words on tiie list* 
7) HR + DH + CRi It is estimated tiiat the student would score 

better tiian 85% of the norm group and would meet tiie criterion 
of 90% correct on the entire list of sixtii' grade spelling 
words* It is estimated that the student eaii spell 90% of tiie 
words on the list. 
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^^T"^e;2.^:^ , : score Reporting on'Criterion Teste }:f ^f^^::^}^^^^}^ 

I gTeitz-Publigaer/Date Test Category — Seore'-Rep oftlBq---. '-'^V'^;^;^;^;^ 

■fsklili HonitXring Systei - CRITEHOH On the Skill Loeatof U' suney) thSriaSo-t!^ 

fc'Vvgiadlng ' iteias per objective^ Both lust be anawered'-'-'Jg 

^y^Thi' Psychological Corporation/ correctly for mastery. On the shorter: Skill ,-,T^^ 

^fe— Harcourt jrace Jovanevich, Inc* Minis (8-12 items), %Q% must be answered , ... ;X 

fel974-75 " correctly for mastery, ^ ■ <■ - ;:c 

|V~ ------------------------------ r?"""""— ""*r"""""T*""*"'-^ 

t''» Masteryj ' An Evaluation Tool ^ CRITIMOS The ststery tests contiin'tiirie' itiis per 

£ ; . SOME Reading and Hatheiatics objecti'^i.". AU ttree must be inswertd 

|:':Seience Research Associates 1975 ' . ■ correctly for mastery, 

?• Fountain Valley Teacher Support ' CRITERION . Seventy-five prcent of the items must be "'if 

System - - Reading and - - - answered correctly for proficiency en in . --./r 

^'.'^ Kattematics objective. Each test measures approximately^. - -.I 

. • ~ '- §i3E objectives. - . , . . . , .^.i^l'^lg 



Prescriptive Reading Inventory 

CTB/McGraw Kill 

1972 



Tests of Achievement in Basic 
Skills ~ Mathematics and 
Reading 

Educational and Industrial 
Testing Service 



CRIflRlON Number of items. per objectivfiYwies. In the 
case of tiiree itsBS|. two p»t^,f three correct 
indicates iistery. For iour items, tiiree out 
of four indicates mastery. 



CRITERION Criterion varies according to subject matter 
' and level. In Level- 2 reading, objectives 
with tiiree items or less require M% pro- 
ficiency. The criterion level is 753^ for- 
objectives liith four or more test items. 
Level B Math has only one item per objective. 
A correct risponsi to the item indicates 
accomplishment of iie objective. ^ 



Dpren-'Diignostic Reading Test, 
of «ord Recopition SkiiTs' 
American Guidance Service 
1973 - 



CRITERION If more tiiai six items are^ answered in- 
, correctly dn- any skill' area, remediation 
• is indicated for the area. This is equiva- 
lent 'to a criterion of 70^. 



Score Reporting on Criterion Tests 



Score Reporting 



Nuaibir "Correct is riporttd for iich anplifiid.- 
objectivi'J "niere are 5 or 10 iteis per \ 
iiBplified objictivt. Bom norroativi dita '\ 
is available but raw scores are not converted - 
to nomativs scores. 



On each of the eight Bl^ck Assessients, tiii ' 
nmber of iteii correct on. each of four out- 
conies is reported. However, the Reading 
llaceuint Aid is criterion referenced. The 
first page on which the pupil„scores 6 or 
less determines the suggested block assign- 
lent, • ' 



Impleientation varies froa systeia to systei, 
but those in which itens are sampled froQ 
denains lay. be eallid domain tests, Icores 
are reported for each objective, TypiciUy 
the objective is tested at several different 
points in time using different samplis of ~ 
items to measure tiie objectives, producing 
longitudl'.iil data, See Gortii et- il, (1575) 
aid Sinsion and Rabdil (1974) for more 
infonationo ' i 



Teg't". Fttblisher. Date 

Objectives-Based Tests 
Instructional Objectives 
Exchange (lOX) 

1974 . ' 



Teat Citeqoi 



Reading Block Asseisment and 

Reading Placement Aid 
SWBL Educational Restarch and 
V Development/Ginn & Co, 
1976 



DOMAIN 



Comprehensive Achievement 
Monitoring (CAH) lystems 



DOMAIN 



